Make MultiHeadAttention op return attention probabilities by amancini-N · Pull Request #23125 · microsoft/onnxruntime

amancini-N · 2024-12-16T16:24:39Z

Description

Add an additional optional output to MultiHeadAttention op, allowing to return attention probabilities.

Motivation and Context

Fixes MultiHeadAttention op shall return attention probabilities #23124

tianleiwu · 2024-12-17T17:51:30Z

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline

tianleiwu · 2024-12-17T17:51:31Z

/azp run Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-linux-gpu-ci-pipeline,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline,Android CI Pipeline

tianleiwu · 2024-12-17T17:51:33Z

/azp run iOS CI Pipeline,ONNX Runtime React Native CI Pipeline,CoreML CI Pipeline,Linux DNNL CI Pipeline,Linux MIGraphX CI Pipeline,Linux ROCm CI Pipeline

azure-pipelines · 2024-12-17T17:52:01Z

Azure Pipelines successfully started running 6 pipeline(s).

azure-pipelines · 2024-12-17T17:52:09Z

Azure Pipelines successfully started running 10 pipeline(s).

azure-pipelines · 2024-12-17T17:52:10Z

Azure Pipelines successfully started running 9 pipeline(s).

tianleiwu · 2024-12-17T17:54:49Z

onnxruntime/contrib_ops/cpu/bert/attention_cpu_base.h

+    T* attn_probs_data = nullptr;
+    if (attn_probs == nullptr) {
+      size_t bytes = SafeInt<size_t>(batch_size) * num_heads_ * sequence_length * total_sequence_length * sizeof(T);
+      attention_probs = allocator->Alloc(bytes);


There is no need to allocate extra space if we do not output it. You can follow the handling of output_qk (temp result of q*k before softmax) in this function.

If we do not output both q*k and softmax(q*k), we can consolidate them together by using a boolean flag to indicate whether we need output the one before softmax or after softmax.

tianleiwu · 2024-12-17T18:51:07Z

onnxruntime/core/graph/contrib_ops/bert_defs.cc

                "or present state for self attention value with shape (batch_size, num_heads, total_sequence_length, head_size)",
                "T",
                OpSchema::Optional)
+        .Output(3,


You will need update documents (You can find the updated documents in artifacts of Windows GPU Doc Gen CI Pipeline for this PR).

tianleiwu · 2024-12-17T21:18:45Z

onnxruntime/core/graph/contrib_ops/bert_defs.cc

+      auto& key_shape = getInputShape(ctx, 1);
+      auto& key_seqlen_dim = key_shape.dim()[1];
+      auto& past_seqlen_dim = getInputShape(ctx, past_key_index).dim()[2];
+      if (key_seqlen_dim.has_dim_value() && past_seqlen_dim.has_dim_value()) {


Add a condition of !past_present_share_buffer here.

snnn · 2025-07-03T18:00:52Z

This pull request has been automatically closed because it has merge conflicts and has been inactive for more than 30 days. Please rebase on the target branch and open a new PR.

amancini-N added 2 commits December 11, 2024 14:18

Allow returning attention probs from MultiHeadAttention

3147d51

Add CUDA implementation for attn_probs

239df8b

tianleiwu reviewed Dec 17, 2024

View reviewed changes

snnn closed this Jul 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make MultiHeadAttention op return attention probabilities#23125

Make MultiHeadAttention op return attention probabilities#23125
amancini-N wants to merge 2 commits intomicrosoft:mainfrom
amancini-N:attn-probs-mha

amancini-N commented Dec 16, 2024

Uh oh!

tianleiwu commented Dec 17, 2024

Uh oh!

tianleiwu commented Dec 17, 2024

Uh oh!

tianleiwu commented Dec 17, 2024

Uh oh!

azure-pipelines bot commented Dec 17, 2024

Uh oh!

azure-pipelines bot commented Dec 17, 2024

Uh oh!

azure-pipelines bot commented Dec 17, 2024

Uh oh!

tianleiwu Dec 17, 2024 •

edited

Loading

Uh oh!

tianleiwu Dec 17, 2024

Uh oh!

tianleiwu Dec 17, 2024

Uh oh!

snnn commented Jul 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

amancini-N commented Dec 16, 2024

Description

Motivation and Context

Uh oh!

tianleiwu commented Dec 17, 2024

Uh oh!

tianleiwu commented Dec 17, 2024

Uh oh!

tianleiwu commented Dec 17, 2024

Uh oh!

azure-pipelines bot commented Dec 17, 2024

Uh oh!

azure-pipelines bot commented Dec 17, 2024

Uh oh!

azure-pipelines bot commented Dec 17, 2024

Uh oh!

tianleiwu Dec 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianleiwu Dec 17, 2024

Choose a reason for hiding this comment

Uh oh!

tianleiwu Dec 17, 2024

Choose a reason for hiding this comment

Uh oh!

snnn commented Jul 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tianleiwu Dec 17, 2024 •

edited

Loading